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Abstract 

The problem of predicting a sequence xx,xz,--- generated by a 
discrete source with unknown statistics is considered. Each letter 
Xt+i is predicted using the information on the word X1X2 • • • Xt only. 
This problem is of great importance for data compression, because 
of its use to estimate probability distributions for PPM algorithms 
and other adaptive codes. On the other hand, such prediction is a 
classical problem which has received much attention. Its history can 
be traced back to Laplace. We address the problem where the sequence 
is generated by an i.i.d. source with some large (or even infinite) 
alphabet. A method is presented for which the redundancy of the 
code goes to even for infinite alphabet. 
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1 Introduction 



The problem of prediction can be traced back to Laplace [SJ 123] • Presently, 
the problem of prediction is investigated by many researchers because of 
its practical applications and importance for probability theory, information 
theory, statistics and other theoretical sciences, see jU HOI CHI 122 ESj • 

Consider a source with unknown statistics which generates sequences 
X1X2 • ■ • of letters from an alphabet A = {a\, ■ ■ ■ , a n }. The underlying true 
distribution, which is unknown, is indicated by the letter p. Let the source 
generate a message x\ . . . x t -\X t ■ ■ ., x,- t G A for all i, and let z/(a) denote 
the count of letter a occurring in the word x\ . . . x t -\X t . After the first t 
letters x±, . . . , x t -i,x t have been processed the following letter x t +\ needs to 
be predicted. By definition, the prediction is the set of non-negative num- 
bers p*(ai\xi ■ ■ • xt), ■ ■ • ,p*(a n \xi ■ ■ - Xt) which are estimates of the unknown 
conditional probabilities p(ai\X\ • • • x t ), • ■ ■ ,p(a n \xi • • • x t ), i.e. of the proba- 
bilities p(x t+ i — cii\xi ■ • • x t ); i = 1, • • • , n. 

Laplace suggested the following prediction p* L (a \ x\ ■ ■ ■ x t ) = (u t (a)+l)/(t+ 
\A\), where \A\ is the number of letters in the alphabet A, (The 
problem which Laplace considered was to estimate the probability that the 
sun will rise tomorrow, given that it has risen every day since the creation. 
Using our terminology, we can say that Laplace estimated p{r\rr • ■ -r) and 
p(f\rr ■ ■ -r), where {r, f } is the alphabet ("sun rises", "sun does not rise") 
and the length of rr ■ ■ -r is the number of days since the creation.) 

We will estimate the precision of the prediction by the Kullback-Leibler 
divergence between the true distribution p(.\xi...x t ) and its estimation p*(.\xi...x t ), 
where 7 denotes a prediction method. 

In data compression the average divergence is called redundancy and plays 
a key rule. Let us comment on the relation to data compression in more 
detail. An encoder can construct a code with codelength l*(a\x\ ■ • -xA ~ 
— logp*(a|xi ■ • • Xt) for any letter a G A (the approximation may be as accu- 
rate as you like, if the arithmetic code is used, see [H1II2])- An ideal encoder 
would base coding on the true distribution p and not on the prediction p* . 
The difference in performance (the redundancy) measured by the average 
code length is given by 

^2 p(a\xi ■ ■ ■ xt)(- logp*(a|xi • • • x t )) - ^2 P( a \ x i ' ' ' x t) (- logp(a|a;i • • • x t )) 
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/ i m v{a\xi ■ ■ ■ x t ) 

2^p(a|xi ■■■x t ) log- 



iGA 



p*(a\x\ ■ ■ ■ x t ) 



So, we can see that from a mathematical point of view the prediction and 
adaptive coding are identical and can, therefore, be investigated together. 
Note that such a scheme of adaptive (or universal) coding was suggested in 
[3] and now is called as a PPM algorithm. 

It is known that the redundancy of the Laplace predictor is upper bounded 
by (\A\ — l)/(t + 1) if the predictor is applied to an i.i.d. source. 

Krichevsky [SJ E] investigated the problem of optimal minimax predictor 
for i.i.d. sources and showed that, loosely speaking, the redundancy of the 
optimal predictor is asymptotically equal to (\A\—l)/(2t)+o(l/t). We can see 
that, on the one hand, the precision of the predictors essentially depends on 
the alphabet size \A\. On the other hand, there are many applications, where 
the alphabet size is unknown beforehand and can be upper bounded only. 
Moreover, quite often such a bound is infinity. That is why the problems 
of prediction and, especially, adaptive coding for large and infinite alphabet 
sources have been a subject in literature before, see |%1 ITU IT51 ITU 123]. 

In this paper we suggest a scheme of adaptive coding (and prediction) 
for a case where a source generates letters from an alphabet with unknown 
(and even infinite) size. This scheme can be applied along with Laplace, 
Krichevsky and any other predictors. If the suggested scheme is applied to 
s— letter source and s is unknown, the redundancy is asymptotically the 
same as if the predictor is applied to the (s + 1)— letter source and the 
alphabet size (s + 1) is known beforehand. 

When the suggested scheme is applied to an infinite alphabet, the redun- 
dancy of the code goes to 0, if, loosely speaking, the original representation 
of the alphabet letters has a finite average word length. It will be shown 
that, in fact, this condition is necessary for existing of such predictors. 

We mainly consider of prediction for i.i.d. sources, because the 

predictors for i.i.d. sources are used as a "core" in the PPM scheme and 
other practically used adaptive codes and the performance of those codes 
effectively depends on the redundancy of the i.i.d. predictors. Besides, all 
results can be easily extended to Markov sources, using well known methods, 
see, for example., [5] I2T]. 
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2 Definitions and Preliminaries 



Consider an alphabet A = {ax,---,a n } with n > 2 letters and denote by 
A 1 the set of words x\ • • • Xt of length t from A. Let p be a source which 
generates letters from A. Formally, p is a probability distribution on the set 
of words of infinite length or, more simply, p = {p l )t>i is a consistent set 
of probabilities over the sets A 1 ; t > 1. By Mq (A) we denote the set of 
Bernoulli (or i.i.d.) sources over A. 

Denote by -D(-||-) the Kullback-Leibler divergence and consider a source 
p and a method 7 of prediction. The redundancy is characterized by the 
divergence 

r 7iP (a?i ■■■x t ) = D (p(-\ Xl ■ • ■ ar t ) (-|ari • • -a^)) 

/ 1 m p{a\xi---x t ) . , 

= 2^ p O Ii ■■•X t ) log —— r. (1) 

t^A p*{a\x 1 ---x t ) 

(Here and below log = log 2 .) 

For fixed £, random variable, because X\, X2, • • • , x t are random 

variables. We define the average redundancy at time t by 

r\p\\-f) = E (r 7)P (-)) = J! p(zi---Xi) r 7iP (xi • • ■ Xt). (2) 

Related to this quantity we define the maximum average divergence (at 
time t) by 

r t (M||7) = supr t (p|| 7 ), (3) 

where M is a set of sources. If the Laplace predictor 

pl(a\ Xl ■■■x t ) = {u\a) + l)/{t + \A\) (4) 

is applied to i.i.d. source, its average redundancy (r*(p||L)) is upper bounded 
by (|A|-l)loge/(t + l), i.e. 

r*(M ||L)<(|A|-l)loge/(£+l), (5) 

see pm i2T|.fHere e = 2.718... is the Euler number and, as before, v t {a) denote 
the count of letter a occurring in the word x\ . . . Xt~\Xt-) In jS] Krichevsky 
investigated the problem of optimal predictor for the set of i.i.d. sources and 
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showed that for any predictor ir the maximal redundancy is asymptotically 
lower bounded by (\A\ — 1) loge/2t : 



t— >oo 



lim sup (2t r*(M ||7r)) > (\A\ - 1) loge. 



He has also suggested the predictor 



pkM x ± ■■■ x t) = (A a ) + <*)/(* + S \ A \)> 



(6) 



where 5 = 0.50922..., and shown that it is asymptotically optimal: 



lim sup (2t ^(MoWKi)) = (\A\ - 1) loge. 

t — >oo 



(7) 



As we mentioned above, predictors for i.i.d. sources can be easily extended to 
Markov sources (see, for example, jHllS]) and to the general stationary and 
ergodic sources, as it was suggested in [THUHj. But, it is worth noting that, 
as it is shown in [TH], there exist such stationary and ergodic sources that 
their divergence does not go to 0. More precisely, for any predictor 7 there 
exists such a stationary and ergodic source p, that lim^oo sup ( r*(pjj'y)) > 
log I A |, (see EJ CD])- But, on the other hand, it is shown in [THl HI] 
that there exists such a predictor p, that the following average i?*(p||p) = 
Y?i=x rl (p\\p) g° es to for any stationary and ergodic source p, where t 
goes to infinity. We will focus our attention on the per letter redundancy 
r*( || )), but all estimates can be easily extended to the i?*( || ). 

3 The new scheme 

Let, as before, p be an i.i.d. source generating letters from the alphabet 
A. The probability distribution p(a), a G A is unknown and each letter x t +i 
should be predicted (or encoded) using information on the word X1X2 ■ ■ ■ x t 
only. 

The suggested scheme can be applied to any predictor (or letterwise cod- 
ing), but we will use the Laplace predictor as the main example. We start the 
description of the new scheme using a simple example. Let A = {a , a 1; a 2 , } 
and t = 4, — (1q(12CLqCLq . The "common" Laplace predictor is as 

follows: 



Pl(x 5 = a„) = (3 + l)/(4 + 3) = 4/7,p* L (x 5 = a,) = (0 + l)/(4 + 3) = 1/7, 
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p*(x 5 = a =(l + l)/(4 + 3)=2/7, 



see (dJ). In this example we suggest to group letters into two subsets A Q = 
{00,01},^! = {a 3 } and carry out the prediction into two steps. First, the 
original sequence a a2a a is represented as Ao^^Ao an d belonging to the 
subsets is predicted as follows 

Pl(x 5 G Aq) = (3 + l)/(4 + 2) = 2/3, p* L (x 5 eA 1 ) = (l + l)/(4 + 2) = 1/3. 

We know that A\ contains one letter (03), hence, p* L {x^ = 03) = 1/3. The 
sequence AoAxAqAo, which contains three letters A , is used for predicting 
conditional probabilities p(x 5 = cii/x 5 G Aq), i = 0,1. Again, we apply 
the Laplace predictor (@J) to the sequence y4 v4 v4o = a aoa Q and obtain the 
following prediction: p(x<j = a Q /x^, G Aq) = (3 + l)/(3 + 2) = A/5,p(x5 = 
ai/x 5 G A ) 

= (0 + l)/(3 + 2) = 1/5. So, combining all predictions, we obtain p* L (x 5 = 
ao) = (2/3) (4/5) = 8/15, p* L (x 5 = a,) = (2/3)(l/5) = 2/15, p* L (x 5 = a ) = 
1/3. We can see that this prediction and the "common" one are different. 

It will be convenient to describe the general case using the notation of a 
tree. Let T be a rooted tree, which contains \A\ leaves, and let each leaf be 
marked by one letter from A in such a way that different leaves are marked 
by different letters. 




a 3 a 6 Si a 5 a 4 



Figure 1: A Ql = {at, a 3 , a 6 }, A a2 = {a 2 }, A a3 = {a 4 , a 5 }, A ai l = {a 3 }, ... . 



We will mark each vertex fi G T by a subset of A^ as follows. We 
consider the subtree T M whose root is the vertex fi and define the subset 
A^ by the set of all letters, which mark the leaves of the subtree T^. Note 
that T root = A. We denote vertexes of the depth one by a,i,i = l,...,k, 
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the vertexes of the depth 2 by Oiij,j = l,...,ki,i = l,...,k, etc., where k 
is the number of the depth 1 vertexes, ki is the number of the sons of the 
vertex i, etc. The prediction is carried out as follows. First a generated 
sequence x\ . . . x t is represented as the sequence A aii A a%2 . . . A ait , where 
is such a vertex of the depth one, that the letter Xj belongs to the subset 
A a . . Then, A ai A ai2 . . . A a . t is considered as the sequence generated by an 
i.i.d. source with the alphabet {A ai ,i = 1, ...,&} and the next letter A a . 
is predicted (say, by Laplace predictor). In fact, p* L (A ai ) is the estimation 
of Pr (x t+ i G A a .). Then, for each vertex aij,i = l,...,k,j = l,...,ki of 
the depth 2, which is not a leaf, we estimate the probability Pr(x t +i G 
A ai j /x t +i G A ai ). For this purpose for each i we consider all letters A ai in 
the sequence A ai A a . . . . A a . and organize the following sequence of their 
sons A aiji ...A aijs , whose length s equals the count of occurrences A ai in the 
sequence A ai A ai . . . A ai . This sequence is considered generated by an 
i.i.d. source and the next letter A a . . is predicted. As a result, we obtain 
p* L (A ai j ), which are, in fact, the estimations of the probabilities Pr(x t +i € 
A aij /xt+i £ A at ). And so on. Then we calculate the predictor P^(x t +i = 
a) as a product Pr (x t+1 e A a% ) Pr(x t+l e A ai j /x t+ i G A ai ) Pr(x t+1 e 
Aa^Jxt+i e A aij ) where a e A ai , a e A aij , a e A aijm , .... 

The following example is to illustrate all steps. Let the alphabet A be 
{ai, . . . , a 6 } and the tree T is shown on the figure 1. Let the generated se- 
quence X\ . . .x t be a^aia^a^a^a^a^. According to the tree T, firstly this 
sequence is represented as A ai A ai A a3 A a3 A a2 A a3 A a3 A a2 A ai and the next 
letter x t +i is predicted as follows: Pl r (x t +i € A ai ) = (3 + l)/(9 + 3), 
Pi r {x t+l E A a2 ) = (2 + l)/(9 + 3), P* Lr (x t+1 E A a3 ) = (4 + l)/(9 + 3). 
Then we divide the sequence of subsets A ctl A ai A c ^ 3 A ct3 A 0C 2A ct3 A O£3 
A a2 A ai into the three following subsequences A ai A ai A ctl , A a2 A a2 and A a3 A a3 A a3 A, 
The set A a2 contains only one letter a2, hence, the prediction of this let- 
ter coincides with the prediction of the subset A a2 : p*L r {xt+i = 02) — 
P£ (xt+i G A Q2 ) = (2 + l)/(9 + 3). The first subsequence A ai A ai A ai is rep- 
resented as a 3 ai03 and the next letter is predicted by P£ (xt+i = a\/x t +i G 
A^ ) = (1 + l)/(3 + 3), P* Lr (x t+1 = a 3 /x t+1 G A ai ) = (2 + l)/(3 + 3), 
P£ T (x t+ i = a Q /x t+ i G A ai ) = (0 + l)/(3 + 3). The third subsequence 
-/iq,^ -^j-ck3 ^4^3 -^-0:3 is represented as 05050504 and we obtain the following pre- 
diction: P[ r (x t+1 = at/xt+i G A aa ) = (1 + l)/(4 + 2), Pt r (x t+1 = a 5 /x t+1 G 
A Q3 ) = (3 + l)/(4 + 2). Finally, we obtain 

P£ T (x m = a ± ) = (4/12)(2/6),P2 T (x m = o 2 ) = (3/12), 
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P* Lr (x t+1 = as) = (4/12)(3/6),P£ T (x m = a 4 ) = (5/12)(2/6), 

P* Lr (x t+1 = a 5 ) = (5/12) (4/6) ,P* Lr (x t+1 = a 6 ) = (4/12)(l/6). 

The properties of the suggested scheme are as follows. 

Theorem 1. Let X\. . .xt be a sequence generated by an i.i.d. source 
from an alphabet A. If the letter x t +i is predicted by the suggested scheme 
according to a tree T with the set of vertex A , then the following upper 
bound for the precision (or redundancy) (j2J) is valid: 

r\p\\L T )< loge £min{(a(A)-l)/(* + l),K^A)}, (8) 

AG A 

where cr(A) is the number of sons of the vertex A, p(A\) = J2aeA x p( a ), Lr 
is a notation of the predictor. 
Proof is in Appendix. 

Comment. The summands, which correspond to leaves in (jSJ), are equal 
to 0, because Ax — 1, if A is a leaf. So, if we define the set of internal vertex 
by A int we can rewrite (jHJ) as follows: 

r\p\\L T ) < loge £ min{ (<r(A) - l)/(t + 1), p(A x )} . 

AG Aj„t 

Corollary. The "common" Laplace predictor corresponds to the tree 
which consists of a root with \A\ sons (leaves). From Theorem 1 we immedi- 
ately obtain the upper bound 

4 The unknown and infinite alphabet 

Of course, the suggested scheme can be generalized in such a way that the 
tree T, which determines the algorithm of prediction, depends on the encoded 
part x\...x t . We consider one such scheme, which is close, in spirit, to some 
methods basing on the estimate of an escape probability [HI |2H1 , and show 
how Theorem 1 can be used for obtaining an asymptotic estimations of the 
redundancy in the case where the algorithm of prediction (or the tree T) 
depends on the sequence. 

In order to describe this predictor we define the subset Aq, which contains 
all 0-frequent letters at the moment t and let A l + = A — Aq. The predictor 
r) t is defined as follows: 
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rf(a) 



(v*(a) + l)/(t+\A%\ + l), if v l 
+ 1)1^1), if 



(a) > 0; 
(a) = . 



In words, the letter x t +i is encoded according to the tree T t , whose root has 
\A* + \ + 1 sons. Of these, |j4*J are leaves corresponding to the letters from 
the set A l + (i.e. their frequencies are greater than 0). One son of the root 
corresponds to the set Aq and has \Aq\ sons, which, in turn, correspond to 
the letters from A l . This code is close to some methods, which are based 
on estimate of the escape probability, see [231 ■ It is interesting that, asymp- 
totically, the increase of the redundancy is the same as if the alphabet size 
would increase by one. In words, if the suggested scheme is applied to s— 
letter source and s is unknown, the redundancy is the same as if the predictor 
is applied to (s + 1)— letter source and the alphabet size (s + 1) is known 
beforehand. More precisely, the following property is valid. 

Theorem 2. Let the predictor r\ t be applied to an i.i.d. source p with 
an alphabet A. Then, the following inequality is valid: 



where s is the number of the alphabet letters whose probability is not 0. 
Proof is in Appendix. 

Comment. The Krichevsky predictor ((HJ) can be used along with the de- 
scribed scheme (instead of the Laplace predictor). If we define this predictor 
as fjt, then the following inequality is valid: 



If we compare these inequalities with the redundancy of the common 
Laplace predictor (jSJ) and the Krichevsky one (jjj), we can see that the "pay- 
ment" for the lack of knowledge of the alphabet size is asymptotically as 
large as it could be caused increasing the alphabet by 1 letter. In words, if 
the letters with nonzero probabilities are known beforehand, the redundancy 
of the common Laplace predictor is (s — l)/(t + l), where s is the number of 
letters with nonzero probabilities. 

Now we consider the case of infinite alphabet A = {ax, a n , ...} . We 
suppose that there is an i.i.d. source, which generates letters from A with 
a probability distribution p. Obviously, the first letter x\ must be encoded 
by an encoder and decoded by a decoder based on an initial code (over 



limsup (t J"*(p)||?7t) ) < min{s, \A\ — 1}, 



limsup ( 2t r*(p)||^t) ) < min{s, \A\ — 1}. 



t— >oo 
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some finite alphabet), which is known to both the encoder and the decoder. 
(Otherwise, the first letter cannot be transmitted.) We denote this initial 
code by c = {c\,C2, ■■■} and suppose that the letter a, is encoded by the 
word q. Of course, we also suppose that the code c is uniquely decodable. 
Moreover, we will suppose that c is the prefix code, because it is known 
that for each decodable code c there exists such a prefix code c* that their 
codeword lengths are equal (|c$| = |c*| for all i), see jU]. 

If one wants to use this code for encoding and decoding the first letter 
x\, the average codeword length of this letter should be finite, i.e. 

oo 

E p {Ci) = J2p( a i)\°i\ < 00 ' ( 9 ) 
i=l 

For example, if the (binary) code and the probability distribution are as 
follows: 

Cl = 0, c 2 = 10, c 3 = 110, c 4 = 1110, ...;p(ai) = 1/2, p(a 2 ) = 1/4, p(a 3 ) = 1/8, 

then the average codeword length .E p (|cj|) = 2 bits and, hence, such a code 
can be used for a coding xi- On the other hand, consider the same probability 
distribution and a new code c such that |q| = 2*. This code cannot be used, 
because its average codeword length is infinite and the first letter x\ cannot be 
transmitted from an encoder to an decoder. So, our requirement is as follows: 
We do not know the source probabilities, but know a code c, whose average 
codeword length is finite. The following theorem shows that this requirement 
is sufficient for existing such an adaptive code, whose redundancy goes to 0. 
In a certain sense this requirement is necessary, because nobody can transmit 
the first letter of the generated sequence, if he/she does not know such a code. 

Theorem 3. Suppose that there are a source generated letters from an 
infinite alphabet A with (unknown) distribution p and a prefix code c(a), a G 
A over some finite alphabet such that the average length E p (\c(a)\) is finite. 
Denote by T the tree which corresponds to the code c and let L r be the 
Laplace predictor corresponding to the tree T. Then the redundancy of this 
method goes to 0: 

lim r\p\\L v ) = 0. 

t — >oo 

Proof is in Appendix. 

Corollary. If the source alphabet A is infinite, but it is known that only 
a finite number of letters have nonzero probabilities, the average codeword 
length Q is finite for any code and, hence, Theorem 3 is valid for any prefix 
code. 
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5 Appendix 



The proof of the theorem 1. We will use two following lemmas. 

Lemma 1. Let a Bernoulli source generate t letters from the alphabet 
A* = {A, A} with probabilities p(A) and p(A) and let &(A) be a count of 
occurrence of the letter A in the generated sequence. Then the following 
inequality is valid: 

p(A) E p { 1/(0(A) + 1) < min{p{A), l/(t + 1) }, 

where E p is an expectation. 

Proof of the lemma 1. The inequality p(A) E#( + 1) < p(A) is 

followed from the obvious inequality E p ( l/(-&(A) + 1) < E#( 1) = 1. 

The second inequality p(A) E${ + 1) < l/(t + 1) is proved as 

follows: 

p(A) E,( l/WA) + 1) = P (A) £( \ ) P {A)'{1 - p(A)) t+1 ^ ) 

3=0 J 

< 7^7 £( * t 1 - P( A » t+1 ~ S ) = TXT- 

The lemma is proven. 

Lemma 2. Let a Bernoulli source generate i letters from the finite 
alphabet A = {a} with probabilities p(a),a e A. Then the redundancy of 
the "common" Laplace predictor (jlj) can be upper bounded by log e ^7^ ■ 
(Though Lemma 2 is known for the Laplace predictor ([2H1E]) we give the 
details of the proof for the convenience of the reader.) 
We employ the general inequality 

Z^llr?) < loge (-1 + £ Ma) 2 M«)), 

a<=A 

valid for any distributions p and rj over A (follows from the elementary in- 
equality In a; < x — 1) From the definition of the redundancy (or precision) 
flTj), © and flU) we obtain 

r\p\\L) = E p tD(p(- | ^! • - | Xf-x t )) 

= E pt (D(p\\p* L (- | Xf-x t )) < loge(-l+ P{x\---x t ) 
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kk^(x 1 -x t ) + l ) l0g6{ L + ^ i + 1 
(( ] )p(ay(l-p(a)f- i )) = loge(-l + £ p ( fl ) 

E( ■ t J - P(a)) w ) < + ^±J^1 ^ p(o ) 

i=o * + 1 r + 1 aGA 

E( 1 1 1 W")^ 1 - P( a )) t+1 ~ J ) = l0 § e TXT 1 - 

The lemma is proven. 

Consider a letter a <E A and let a^, a^^, . . . , ca lt i 2 ,...,i s be a sequence of 
vertexes such that 

^4cv - ZD -^4/-v ■ ... ZD -/irv - 

and j is = {a}- The estimation of the probability (or prediction) 
p(xt+i = a) can be represented as follows 

pl r (a) = (iy\A aii ) + l)/(t + a(root))) ((A^J + l)/(^(A* n ) + <r( 
(^(A* il , i2 , i3 )+l)/(^(A ail , i2 )+<T(a il) i a )) • • • 

where z/(A a ) is a number of occurrences of letters from the subset A a in the 
sequence x\ . . .xt- From this equality and the definition of the redundancy 
we obtain the following equality: 

r\p\L r ) = E p(xi ... Xt) (J2 P{o)\og{p{a)/pl r {a)) = ^ p(a) 
{ log 



+ log 
+ log 



((^(A aii ) + l)/(t + a(root) 
(p(A ail J/p(A an )) 
'y(A an J + l)/(AA an )+a(a tl )) 

(P(^ 1 , i2 , i3 )/P(^a il , i2 )) 



((AA ail , i2 , i3 ) + l)/{AA aiui2 + a(a iui2 ))) - 

(p(4,,.,JM4,,.„ vl )) 

° g ((^(4>„ ,J + l)/(AA ail im J + o(n, t , .))) 1 j - 
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Grouping summands, we obtain 



p(- 4 «,,) ^ p(- 4 «,, .._,) 

g l><( 4 <.„ J + 1)/M4,„ J + «r( Qil „_,)" + •- 

where E{ ) means an expectation. It is worth noting that such a grouping is 
correct for the case of infinite A, because all values E( ) are nonnegative. 

Having taken into account this equality and Lemma 2 we can upper bound 
the redundancy as follows: 

r\p\L r ) < loge { (a(root) - l)/(t + 1)) 

ff ( ft il,,« s -l) - 1 



Let us estimate one summand, say, 



+ ...}. 



(General case is obvious). From Lemma 1 we obtain the following inequalities 



H «2 
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5>(AO£ P -^E(^(^,«,)-^( KiQil ' i3 ' i3) 1 



*1 12 



= EE Et^K*.*) - 1 ) min {p(^ il , <2 , < 3 ),e( f ^— - )} 

^ EE EC^i.Ms) - l)min{p(A Qi , , ), * }. 
i! i 2 i 3 (* + X ) 

From this estimation and the last upper bound for r*(p|Lx) we obtain the 
inequality 

r\p\L r ) < loge { (a(root)-l)/(t+l))+^(aK)-l)min{p(A Qn ),^ T y} 

1 

+ EE( a ( a n^) - l)min{p(A a<i[<2 ), , } 
ii t2 ^ + ^ 

+ EE E( (T ( a n 

h i 2 h y l + 1 > 

In fact, the last inequality is (jHJ) and Theorem 1 is proven. 

The proof of the theorem 3. Let e > be some number. There exists such 
a finite subset A e <Z A that 

£ p(a)|tf(a)|<e/(21oge), (10) 

ae(A-A e ) 
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because the sum X)aeAP( a )|C( a )l is finite. The redundancy of the code L r 
can be represented as two following summands: 

r«(p||Lr)= £p(a)log(p(a)/p£|,(a))+ £ p(a) log(p(a)M* (a)), (11) 

where is the Laplace probability assignment at the moment t. The number 
of vertex in the subtree corresponding to A £ is finite. Hence, if we apply the 
theorem 1, we obtain the following upper bound for the first summand in 

53 M a ) ^>g{p{a)/p* L t T {a)) < const/(t + 1). 

a£A e 

Hence, there exists such to that the first summand in is less than e/2, 
if £ > to- Let us estimate the second summand in (jllj) . First we define the 
subset A e , which contains all vertex belonged to passes from the root to each 
letter from A — A £ . Applying the theorem 1, we obtain 



53 p{a)\og(p{a)/p* L t v {a)) < loge 53 P( A \) = lo S e 5] ( 53 P( a ))- 

a£(A-A e ) AGA e AeA e aeA A 

If we define the number of vertex belonged to the passe from the root to the 
leaf a as x( a ) we can rewrite the latest inequality as follows. 

53 P( a )\°g(p( a )/Ph r ( a )) < lo g e 12 P( a )x(a)- 

aG(A-A e ) aeA E 

By definition the codeword length |C(a) | of any letter a is equal to x{ a )- 
Hence, having taken into account this fact, from the latest inequality and 
(fTU|) we obtain the following upper bound for the second summand: 

53 p(a)log(p(a)/pl tv (a))<e/2. 

ae(A-A e ) 

So, each summands in (fTU|) is upper bounded by e/2 and it is true for any 
positive e. Theorem 2 is proven. 
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